﻿ Managing metadata variability within a hierarchy of annotation schemas Ionuţ Cristian Pistol1, Dan Cristea1,2 1 Faculty of Computer Science, University “Al I Cuza” of Iaşi, Romania {ipistol, dcristea}info uaic ro 2 Institute for Computer Science, Romanian Academy, Iaşi, Romania Abstract The paper describes the theoretical basis of the ALPE1 model, a hier- archy of annotation formats used to guide the automatic computation of proc- essing flows capable of performing complex linguistic processing tasks The hi- erarchy is comprised of a core, which is a direct acyclic graph whose nodes rep- resent XML annotation formats, and a halo which contains additional annota- tion formats The core hierarchy also serves as a standardization hub for anno- tated documents The focus of the paper is the description of the new additions to the model, allowing the integration and usage of non-XML formats in proc- essing flows and new equivalence relations between XML formats This type of hierarchy is implemented in the ALPE system, which is also briefly described in the paper 1 Introduction In the latter years, the field of Natural Language Processing witnessed the emergence of a significant effort concerning the standardization and usability aspects of devel- oped processing tools and resources Projects such as CLARIN2 and FLaReNet3, among others, intend to offer both developers and users of language resources and tools a management solution for the growing set of resources available The primary objectives of these projects are to provide reusability in new contexts for existing resources and to guarantee maximum visibility and reusability for newly developed resources An easy widening of the original setting of usage means a multiplication of the visibility of a tool and, finally, of the productivity of the research activity In terms of managing linguistic processing tools, previous efforts lead to the development of linguistic processing meta-systems, most significant ones being GATE4 and UIMA5 Both systems allow access to a set of independently developed NLP tools, integrated into an environment offering means to create and use processing chains adding linguistic metadata to an input corpus They allow integration of new tools and new processing chains, but these functionalities require programming experience 1 Automated Linguistic Processing Environment 2 http://www clarin eu/ 3 http://www flarenet eu/ 4 http://www gate ac uk/ 5 www research ibm com/UIMA/ ALPE is another such system, intended to define a framework which facili- tates the integration of LT tools of different origins In ALPE, linguistic resources and tools are placed with respect to a directed acyclic graph whose nodes are annotation schemas and the hierarchical links are subsumption relations between schemas Proc- essing tools are attached to the edges in the graph and ALPE uses these tools to com- pute processing chains, once specific pairs of start-destination formats are given ALPE offers several advantages over existent systems with a similar goal, as it is able to identify the annotated format of the input file, then to automatically compute the processing steps required to bring an input file to the required output format, and, eventually to run this chain if costs/IPR conditions are fulfilled Section two of this paper briefly describes the base ALPE hierarchy and section three describes the enhancement of the base hierarchy with processing power and “clouds” of equivalent formats The conclusions, as well as the further planned devel- opments are described in section four 2 The base hierarchy The following section briefly describes the base of the ALPE hierarchy model, which is the extendended hierarchy of annotation schemas For a more detailed and com- mented description, please consult 2 1 The core hierarchy The basis of the core hierarchy is a directed acyclic graph (DAG) which configures the metadata of linguistic annotation in a hierarchy of XML schemas Nodes in this graph are called core nodes Each core node corresponds to a single XML annotation format In the following, we will consider core nodes as equivalent to a XML annota- tion schema and we will occasionally refer to them as such We note as T(A) , where A is a core node, the set of elements (tags) defined in the XML annotation format corresponding to the core node A We note as ta(A), where A is a core node and ta T(A) is the set of attributes belonging to the element t as it appears in the core node A Edges connecting core nodes are called core edges If there is a core edge linking a core node A with a core node B (we will say also that A is formally subsuming B, noted as AsB) then the following conditions hold simultaneously: - any element (tag-name) of A is also in B: T(A)T(B); - any attribute in the list of attributes of a tag-name in A is also in the list of attributes of the same tag-name of B: ta(A) ta(B) for all tT(A) A B C D Figure 1: An ALPE core hierarchy For the hierarchy in Figure 1 it holds that node A subsumes nodes B, C and D¸ and node B subsumes node D As such, a hierarchical relation between a core node A and one descendent B describes B as an annotation schema which is more informative than A In general, either B has at least one tag-name which is not in A, and/or there is at least one tag-name in B such that at least one attribute in its list of attributes is not in the list of attributes of the homonymous tag-name in A The direction of the core edge connecting nodes A and B is given by the subsuming relation, with the subsum- ing node being the origin of the core edge and the subsumed node the destination Theorem: The subsumption relation s is a partial order relation on the set of core nodes 2 2 The haloed hierarchy In addition to the core nodes and edges, which strictly observe the specified restric- tions, we can include in an extension to the core hierarchy, called a halo, other types of nodes and edges such as: - Nodes representing other annotation formats than XML We can consider each node in the core hierarchy representing not just a specific format, but rather a class of annotation formats, whose representative is an XML format This is the case when a variety of annotation formats, used in practice, are all intended to code the same in- formation, but in different ways These formats can be represented in the hierarchy as halo nodes; - Edges originating or ending in halo nodes These edges can either originate or end in the core hierarchy, or they can be completely outside the core hierarchy The se- mantic value of these edges is to mark the semantic subsumption between the source and destination nodes, relation considered at an abstract level as opposed to the for- malized subsumption relation Semantic subsumption means that the information encoded in the origin node’s format is part of the information encoded in the destina- tion node’s format, but this inclusion cannot be strictly formalized using XML ele- ments and attributes E F A B C G H D Figure 2: A full ALPE hierarchy (core and halo) The core hierarchy and the halo form a full hierarchy The halo nodes and edges expand the core hierarchy outside the limits of XML, allowing other types of annota- tion formats to be represented in the full hierarchy In Figure 2 is shown an example of a full hierarchy based on the core hierarchy in Figure 1 With continuous lines are marked core nodes and edges (and the core hierarchy), and with interrupted lines halo nodes and edges Nodes A, B, C and D are core nodes and nodes E, F, G and H are halo nodes As for the core edges, the following hold true: Axiom: There exists at most one edge between two halo nodes Axiom: There exists at most one edge between any two nodes in the full hierarchy These axioms say that between any two nodes in the full hierarchy there is at most one subsumption relation, either formal or semantic In all core hierarchies we introduce an obligatory root core node This node corre- sponds to the basic XML format, with only a root element The definition of this node leads to the following theorem: Theorem: The root core node of a hierarchy subsumes all nodes in the hierarchy Theorem: The core hierarchy is a connected graph (disregarding the edges orienta- tion) In order to guarantee the connectivity of the full hierarchy, we introduce the follow- ing axiom: Axiom: From each halo node there is at least one core node which can be reached (disregarding the orientation of the edges) This axiom basically says that no halo nodes are disconnected from the core hierar- chy: in order for an annotation format to be included as a node in the hierarchy it has to have at least one other node already in the hierarchy which it subsumes (or is sub- sumed by) - either formal subsumption (introduced in 2 1) or semantic subsumption The previous theorem and this axiom lead to the next theorem: Theorem: The full hierarchy is a connected graph (disregarding orientation of the edges) The proof is direct: the core nodes are connected (as proven by the previous theo- rem) and each halo node is connected to at least one core node This means that from each halo node all core codes can be reached Since all halo nodes can be reached from at least one core node, as a corollary to the previous axiom, this means that from each core node all other nodes can be reached Also, the axiom and the following conclusion lead to the fact that from each halo node all other nodes can be reached Thus, the full hierarchy is a connected graph 3 The hierarchy augmented with processing power 3 1 Adding processing power to the hierarchy If there is an edge (either core or halo) between nodes A and B, there should be a process which takes as input a file observing the restrictions imposed by the node A and produces as output a file observing the restrictions imposed by the node B While doing this type of processing the module might make use also of some additional resources outside the hierarchy, such as language models and lexicons A graph of annotation schemas on which processing modules have been marked on edges is called augmented with processing power (or simply, augmented) An edge to which there is at least one processing module attached will be called a processing edge A single edge in the graph can have multiple processing modules attached to it, if those modules observe the same restrictions regarding their input and output formats If there is no known processing module attached to an edge, that edge is called a carrier edge For reasons due to the way processing paths are computed in the hierarchy, all edges in the hierarchy can be considered carrier edges and can be included as such in the resulting processing flows A node to which points a single carrier edge (and no other edges) is called an iso- lated node 3 2 The processing edges Each node in the hierarchy has a unique identification key (ID) This is introduced to ensure a clear reference to each node The edges, being them of either type, processing or carrier, are noted as E(A,B), where A is the ID of the input (origin) node, and B is the ID of the output (destina- tion) node This notation is unique for each edge in the hierarchy due to the unique- ness of the node IDs and the restriction that between any two nodes there can exist only one edge input observing schema A module p1B(A) p1 additional resources output observ- ing schema B A input observing schema A module p2 edge p2 additional B(A) E(A,B) resources output observ- ing schema B B … Figure 3: A processing edge and the attached processing modules The processing modules are noted as pi(A,B) where i is the unique index of the processing module, A is the origin node of the edge to which the module is attached and B is the destination node of the edge to which the module is attached Processing modules will be used to introduce a series of operations on the aug- mented hierarchy A processing module will be regarded as part of a more complex operation performed on an input file in order to produce an output file These complex operations are called flows in ALPE and will be detailed in section 3 4 The operators for these operations (as well as for the processing modules when taken as a function in an expression) will be either nodes in the hierarchy (representing files observing the schemas corresponding to those nodes) or other processing modules applied on an input node (resulting another node) 3 3 Conversion edges and synclouds An edge connected a halo node with another node of the full hierarchy is called a conversion edge All conversion edges have at least one attached processing module That module is basically a wrapper capable of converting a format into another Part of the encoded information in the source node is rewritten in another format in the destination node No new information is added If both source and destination nodes are core nodes this means that at least one XML element or attribute is written in another way (change of element name, con- verting an attribute into an element, merging attributes or elements, etc ) If one of the connecting nodes is in the halo, the module attached to the conversion edge re- writes either XML into a non-XML format or the other way around, depending on the direction of the edge Conversion edges are created when corresponding processing CA Core hierarchy HA CB Halo HB C C Figure 4: A syncloud modules are integrated in the hierarchy All nodes that can be reached from the same node using only conversion edges (dis- regarding the direction of the edges) form a syncloud (synonymy clouds) All nodes in the same syncloud contain the same information, but encoded differently As shown in figure 4, part of the syncloud can be found in the halo, and part in one node belonging to the core hierarchy 3 4 Operations on the hierarchy As shown somewhere else , two types of operation complement the model: - Hierarchy building operations: initialization (initializes a new hierarchy), classification (finds the place of an input XML schema onto an existing hier- archy) and integration (adds a new processing module onto the hierarchy) - Processing operations: pipelining (combines two individual modules on the same sub document, the second in sequence using the output of the first one), simplification (simplifies an annotated file to the format of another subsum- ing format in the hierarchy) and merging (merges two or more annotations applied to the same hub document onto one) On the augmented hierarchy are defined flows, as processing operations of the above types performed over an input file corresponding to the annotation described in a node in the hierarchy (or a node, for simplification) A processing flow can be defined as any combination of pipeline, simplify, merge operations and void flows, applied to nodes The processing flows can be computed automatically for an input file and a specified output format The process of automatic computation of flows is partially described in 3 5 Flows and synclouds The main benefit of the introduction of synclouds in the ALPE hierarchy model is the accommodation of various available processing modules using non-XML documents as either input or output a b c Figure 5: Flows in synclouds A simple computed flow, on the left of figure 5, shows only the general picture of an ALPE hierarchy, where a node identifies a distinct annotation format Figure 5a depicts the case of the processing modules considered on the left being actual mod- ules between the core nodes of the synclouds Figure 5b shows that a flow can include processing edges between XML and non-XML formats in a syncloud These modules can be pipelined with XML processing modules using wrappers attached to edges between the core nodes and other nodes in the syncloud Also, as in 5c, flows can actually exist outside core nodes and include only process- ing modules using non-XML intermediate formats, allowing the straightforward inte- gration in ALPE of flows produced by other systems, such as GATE or UIMA, with the only change being the addition of the input/output wrappers 4 Conclusions and further work ALPE is thought to offer compatibility with other similar systems as GATE and UIMA The proposed model allows seamless integration of annotation formats, proc- essing modules and processing flows, requiring at most the existence of wrappers, if the involved annotation formats are not encoded as XML core codes Since UIMA is the prominent system comparable to ALPE and since both GATE and UIMA are now open-source, we study the possibility of integrating ALPE in either system This will be done after ALPE is fully implemented (now it’s in the final stages of implementa- tion) After a proof-of-concept can be made for ALPE, we plan to attract other re- searchers interested in developing it further, either as a stand-alone system or inte- grated in broader scope systems, such as UIMA Adopting ALPE as a management and access environment for the resources em- ployed and developed in a computational linguist project proposing the development of multilingual resources and tools, such as CLARIN , has the potential of benefit- ing both the project and the interested user One important further development of ALPE will be a web-service allowing users to build, configure and use ALPE hierar- chies on the web, either as a limited password-protected resource or a global linguistic resources collection This type of hierarchy is able to manage multilingual resources as well as resources which require a fee to be paid before usage Each user will be able to contribute its own tools and annotated resources, as well as using processing chains adapted to its specifications, both in terms of input and output formats and cost and performance issues Standards usually appear late In order for an annotation convention to become a standard it should be adopted by a community of people This happens over a long interval of time, at the moment when a theory has passed the delicate exam of empiri- cal validation Therefore, there is a strong need for accepting new formats, which should work together with well accepted ones We need a mechanism able to “under- stand” the notations, to detect the semantics beyond the notations, to infer the mean- ing of notations and to establish semantic links between new formats before standards appear The precision of this automatic discovery could not be extremely high since the inference techniques should include also heuristics, a little bit of AI When a new mechanism is invented and there is no standard in the hierarchy to describe it, these 6 http://www mpi nl/clarin/ heuristics aim at finding the right place in the hierarchy for the new schema The first step would be the clear definition of the semantics of a standard A promising new model of describing annotation semantics, the Linguistic Annotation Format , has the potential of clearly defining semantic links between various annotation formats We are currently in the process of integrating a version of this model as a way to formally describe nodes in the ALPE core hierarchy and as a possible base for an automatic detection of semantic links between formats This paper touches the above aspect as it proposes a model of integrating XML schemas with non-standardized formats Non-XML formats take the form of syn- clouds as clusters of formats around recognized XML standards Processing flows can be computed automatically as combination of processing modules and wrappers References 1 Cristea D , Forăscu C , Pistol I : Requirements-Driven Automatic Configuration of Natu- ral Language Applications In Bernadette Sharp (Ed ): Proceedings of the 3rd International Workshop on Natural Language Understanding and Cognitive Science - NLUCS 2006, in conjunction with ICEIS 2006, Cyprus, Paphos, May 2006 INSTICC Press, Portugal 92006) ISBN: 972-8865-50-3 (2006) 2 Cristea, D , Pistol, I : Managing Language Resources and Tools Using a Hierarchy of Annotation Schemas Proceedings of the Workshop on Sustainability of Language Re- sources, LREC-2008, Marakesh (2008) 3 Cunningham H , Maynard D , Bontcheva K , Tablan V : GATE: A framework and graph- ical development environment for robust NLP tools and applications In Proceedings of the 40th Anniversary Meeting of the ACL (ACL’02) Philadelphia, US (2002) 4 Ferrucci D and Lally A : UIMA: an architectural approach to unstructured information processing in the corporate research environment, Natural Language Engineering 10, No 3-4, 327-348 (2004) 5 Romary L , Ide N : International Standard for a Linguistic Annotation Framework, Natu- ral Language Engineering 10, 3-4 (09/2004) 211-225 (2007) 6 Váradi T , Krauwer S , Wittenburg P , Wynne M and Koskenniemi K : CLARIN: Common Language Resources and Technology Infrastructure, Proceedings of LREC- 2008, Marakesh (2008) 